Note: This is meant to be both a "capstone project proposal" and a "capstone milestone report".
In [1]:
import pandas as pd
import glob
import os
import numpy as np
from __future__ import division
import sqlalchemy as sqla
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn import linear_model
import matplotlib
Data comes from two sources:
My goal will be to predict the number of taxi dropoffs in an area based on demographic and other stats about the area. Ideally, I would like to understand the demographics of taxicab users and how well taxicabs provide service to NYC residents. I may later expand my data to look at green "boro cabs" (which are only allowed to pick up passengers outside of the Manhattan central business district, data also availible from TLC) and Uber cars(https://github.com/fivethirtyeight/uber-tlc-foil-response), to compare their demographics of usage.
If I do something with the Uber data, I plan to write up a story in the style of fivethirtyeight.com that I may submit to them, as they have done a series of stories on this Uber data.
I would also expect that a paper comparing taxi services would be useful for city government, who want to ensure that there is good coverage throughout the city and throughout demographic groups.
I don't have a lot of RAM on my computer, so I can't manage to hold a whole year's (or even a month's without trouble) worth of individual trip data. Instead, I get a chunk of 300,000 trip dropoff gps locations, and then group them by the gps coordinates, rounded off to 4 decimal places so that I actually have some data reduction.
In [ ]:
path=os.path.expanduser("~/Documents/TaxiTripData/2013taxi_trip_data/")
files=glob.glob(path+'*.csv.zip.gz')
In [ ]:
countlist=[]
cunk=3e5
counts=pd.DataFrame()
for i in files:
print(i)
iterate=pd.read_csv(i,usecols=[12,13],iterator=True,chunksize=cunk,names=['dropoff_longitude','dropoff_latitude'],header=0)
counter=0
for test in iterate:
test=test[InNYCarea(test)]
test['dropoff_latitude']=test['dropoff_latitude'].round(decimals=4)
test['dropoff_longitude']=test['dropoff_longitude'].round(decimals=4)
if counter==0:
counts=test.groupby(['dropoff_latitude','dropoff_longitude']).size()
else:
counts=counts.add(test.groupby(['dropoff_latitude','dropoff_longitude']).size(),fill_value=0)
counter+=1
A note about this data: for some reason or other, I was unable to find the NYC gov website for the taxicab data when I was first looking; and instead was getting data from someone's FOIL request from last year. Thus, this data is for 2013. While it is relatively clean, the month of September has something funky: there are about 2 times as many trips as all the other months, and many of the trips repeat clearly un-reproducible data. Basically every trip has at least one other trip that either has the exact same GPS coordinates, pickup time, or dropoff time, but only about 1/4 seem to be exact copies of other trips.
It's possible that later data is cleaner (NYC automated a data visualizer for it, so it probably is), but I figured that I had already been working with this data and that the lack of one month probably won't make much difference. For the final analysis, I'll probably take a look at later data, or use it as a testing set.
Now I want to sort this data by census tract. Python's GIS seems to be pretty rudimentary, so I used PostGIS, a GIS add-on to PostgreSQL, to match GPS coordinates to census tracts. This problem is pretty hard, so I'm pretty glad that I can use somebody else's code to do this.
I created a SQL database that contains the TIGER shapefiles of the NYC census tracts, indexed by the FIPS code, a numeric code assigned by the census to every geographical area they cover. To do this, I basically followed info from this PostGIS tutorial http://workshops.boundlessgeo.com/postgis-intro/index.html.
Then, I can use the PostGIS ST_Contains function to match all the GPS coordinates to FIPS codes, like so:
In [ ]:
engine = sqla.create_engine('postgresql://myuser:notyours@localhost:5432/TaxiData',echo=False)
In [ ]:
counts.to_sql('abridged2013gpscounts',engine,if_exists='replace')
In [ ]:
fipscounts=pd.read_sql_query(
"SELECT c.geoid10 AS fipscodes, p.counts FROM abridged2013gpscounts AS p, nyctracts AS c WHERE ST_Contains(c.geom, ST_SetSRID(ST_Makepoint(p.dropoff_longitude,dropoff_latitude),4269))",engine)
This has the added benifit of doing some automatic data cleaning, as it throws out any points that don't lie within NYC (sorry middle of the Gulf of Guinea, I don't think that you are really the most popular spot in NYC).
In [ ]:
fips2013counts=fipscounts.groupby('fipscodes').sum()
Then I can add it to a big SQL table of all the data I'm using, with the very creative and descriptive name lotsofdata
In [ ]:
full=pd.read_sql('lotsofdata',engine)
full.set_index('index',inplace=True)
full['abridged2013ycdropoffs']=fips2013counts
full.to_sql('lotsofdata',engine,if_exists='replace')
I'm only using census tracts where the population is greater than 1000. Most of the tracts with population less than that are parks or ocean, and the errors associated with data on these are generally very high. Also, taxi dropoffs are probably not representative of the resident population for these areas.
In [2]:
engine = sqla.create_engine('postgresql://postgres:postgres@localhost:5432/TaxiData',echo=False)
full=pd.read_sql_query('SELECT a.* FROM lotsofdata AS a WHERE a.totalpopulation>=1000',engine).set_index('fipscodes')
For this analysis, I'll be looking at data from three columns of the ACS, with per-capita dropoffs as my dependent variable, i.e. the dropoffs computed above divided by the population given by the census.
This data uses the 2010 census (the most recent), and 5-year data for the American Community Survey, 2007-2011. I figure the ACS data splits the difference between my 2010 population data and the 2013 drop-off data.
The census uses some funky abreviations for their column headers, and I haven't changed them. Here's a description:
I compute the ratio of non-driving commuters to total commuters, labeled nondrivercommuterrat
In [3]:
subset=full[['MOGE001','MOGE011','MRUE001','abridged2013ycdrpoffpc']]
subset['nondrivercommuterrat']=((subset['MOGE001']-subset['MOGE011'])/subset['MOGE001'])
subset.replace(np.inf,np.nan,inplace=True)
subset.dropna(inplace=True)
subset.head()
Out[3]:
In [40]:
# Plotting the data:
#plt.rc('text', **{'usetex':True,'latex.preamble':[
# r'\usepackage{siunitx}',
# r'\sisetup{detect-all}',
# r'\usepackage{helvet}',
# r'\usepackage{sansmath}',
# r'\sansmath'
#] })
comd=0
buff=100
matplotlib.rc('legend', fontsize=10)
a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])
plt.autoscale(tight=True)
plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")
a=plt.scatter(subset['MRUE001'],subset['abridged2013ycdrpoffpc'],alpha=0.5,label='Data',c=subset['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
plt.yscale('log')
plt.xscale('log')
#plt.legend(loc='upper left')
plt.show()
There appears to be a bit of a linear relationship between the logs of the dropoffs per-capita and per-capita income, but it's not that great. There is a funky area where there almost looks like there is a negative slope, but we can see that this area tends to have less non-driver commuters based on the color shown.
In [39]:
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
# r'\usepackage{siunitx}',
# r'\sisetup{detect-all}',
# r'\usepackage{helvet}',
# r'\usepackage{sansmath}',
# r'\sansmath'
#] })
comd=0.8
asub=subset[(subset['nondrivercommuterrat']>comd)]
bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()
#xrdata=((((subset[subset['nondrivercommuterpc']<=comd]['nondrivercommuterpc']))).as_matrix())
#yrdata=subset[subset['nondrivercommuterpc']<=comd]['abridged2013ycdrpoffpc'].as_matrix()
#plt.xlim(-(0.001)**2,(0.003)**2)
buff=100
XX=np.linspace(xdata.min()-buff,xdata.max()+(buff),num=100)
matplotlib.rc('legend', fontsize=10)
a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])
#plt.xlim(xdata.min()-buff,xdata.max()+buff)
plt.xlabel('Income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
#plt.ylim(-0.2,7)
plt.title("Yellow cab dropoffs vs income of census tract")
#plt.title("Trip duration matters")
#plt.xlim(0,0.002)
plt.autoscale(enable=True, axis='both', tight=True)
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')
a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('cool'))
#plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R^2}=$ ' + str(round(fit.rsquared,2)))
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',c=subset[subset['nondrivercommuterpc']<=0.5]['nondrivercommuterpc'],cmap=plt.cm.get_cmap('autumn'))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',color='red')
#plt.colorbar(a)
plt.yscale('log')
plt.xscale('log')
#fitlabel='Fit: '+str(round(res.params['esb-inv'],2))+ '/x - '+ str(abs(round(res.params['const'],2)))
#plt.plot(XX,res.params['esb-inv']/XX+res.params['esb-centroidduration']*XX+res.params['const'],alpha=0.9,color='red',linewidth=2,label=fitlabel)
plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()
If we look at the data with only a high non-driver commuter ratio, the linear relationship is very clear. That funky corner only consists of lower non-driver commuter ratios.
In [6]:
model=sm.OLS(np.log(asub['abridged2013ycdrpoffpc']),sm.add_constant(np.log(asub['MRUE001']),prepend=False))
fit=model.fit()
fit.summary()
Out[6]:
In [34]:
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
# r'\usepackage{siunitx}',
# r'\sisetup{detect-all}',
# r'\usepackage{helvet}',
# r'\usepackage{sansmath}',
# r'\sansmath'
#] })
comd=0.8
asub=subset[(subset['nondrivercommuterrat']>comd)]
bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()
buff=.8
XX=np.linspace(xdata.min(),xdata.max(),num=100)
matplotlib.rc('legend', fontsize=10)
a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])
plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")
plt.autoscale(enable=True, axis='y', tight=True)
a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')
plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R^2}=$ ' + str(round(fit.rsquared,2)))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
plt.yscale('log')
plt.xscale('log')
plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()
We've got a fit that explains most of the variance, with a $R^2$ of 0.689. I suspect that I won't be able to further explain this subset (non-driver commute ratio > 0.8) of the data, but maybe I will.
However, there is still lots to do. I should probably do an analysis of error in this data, to see if there is really any more variance that could possibly be explained (i.e. much of the variance might be due to random fluctuations over time). I also might want to do some analysis to better understand the causal relationship between these two variables. Many of these dropoffs are not very representative of the population, but instead are common destinations, eg Grand Central and Penn Stations, or the area around Central Park. In this case, it is the fact that rents are high around these common destinations that drives the income relationship, not that the rich are more likely to take cabs.
Of course, there is also the problem of the rest of the data that is left to explain. At using just per-capita income as a dependent variable, there doesn't seem to be much correlation (see blob below), and I will need to explore other variables. I might want to use some more advance machine learning techniques in order to figure out which data is useful.
In [36]:
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
# r'\usepackage{siunitx}',
# r'\sisetup{detect-all}',
# r'\usepackage{helvet}',
# r'\usepackage{sansmath}',
# r'\sansmath'
#] })
comd=0.8
asub=subset[(subset['nondrivercommuterrat']<comd)]
bsub=subset[(subset['nondrivercommuterrat']>comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()
#xrdata=((((subset[subset['nondrivercommuterpc']<=comd]['nondrivercommuterpc']))).as_matrix())
#yrdata=subset[subset['nondrivercommuterpc']<=comd]['abridged2013ycdrpoffpc'].as_matrix()
#plt.xlim(-(0.001)**2,(0.003)**2)
buff=100
XX=np.linspace(xdata.min()-buff,xdata.max()+(buff),num=100)
matplotlib.rc('legend', fontsize=10)
a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])
#plt.xlim(xdata.min()-buff,xdata.max()+buff)
plt.xlabel('Income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
#plt.ylim(-0.2,7)
plt.title("Yellow cab dropoffs vs income of census tract")
#plt.title("Trip duration matters")
#plt.xlim(0,0.002)
plt.autoscale(enable=True, axis='both', tight=True)
a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $<$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $>$ 0.8')
#plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R^2}=$ ' + str(round(fit.rsquared,2)))
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',c=subset[subset['nondrivercommuterpc']<=0.5]['nondrivercommuterpc'],cmap=plt.cm.get_cmap('autumn'))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',color='red')
#plt.colorbar(a)
plt.yscale('log')
plt.xscale('log')
#fitlabel='Fit: '+str(round(res.params['esb-inv'],2))+ '/x - '+ str(abs(round(res.params['const'],2)))
#plt.plot(XX,res.params['esb-inv']/XX+res.params['esb-centroidduration']*XX+res.params['const'],alpha=0.9,color='red',linewidth=2,label=fitlabel)
plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()
In [ ]: